REPORT LAYOUT: - Introduction - General Analysis and insights (try to find unique insights) - Analysis of factors affecting revenue - ML and prediction of select movies

Executive Summary

Project Framework: Definition and Methodology

In this report, we will explore the various factors that impact and influence the monetary success of a movie at the box office. Our investigation extends beyond mere fiscal considerations, encompassing a nuanced analysis of important factors such as the expertise of the cast and crew. By scrutinizing these diverse components, this report aims to provide a comprehensive understanding of the factors that defines a movie’s monetary success at the box office.

The data was obtained through the use of our own web scraping algorithm and covers the top 75 grossing movies over the past 25 years.

Temporal Analysis

Over time, the average revenue demonstrates a distinct upward trend, with a notable observation regarding the rate of growth in Foreign revenue compared to Domestic revenue. The surge in global revenue is primarily driven by the rapid expansion of foreign revenue, highlighting the escalating growth and acceptance of Western films in international markets.

The onset of the Covid-19 Pandemic significantly impacted the film industry, evident in the graph. Productions were halted, and theaters closed, leading to a substantial loss of earning potential. The lockdown measures globally disrupted filming schedules, postponed releases, and the closure of theaters eliminated a crucial avenue for revenue. This had a ripple effect across the industry, affecting filmmakers, actors, crew members, distributors, and exhibitors. The industry’s vulnerability to external shocks became apparent, prompting the need for innovative adaptations to navigate the challenges such as online releases.

The impact of the month of film release is a fascinating observation. Notably, films hitting the screens in May and June consistently outperform those released in other months. Utilizing an analysis of variance (ANOVA) shows a significant disparity in average revenue across different release months. Several factors contribute to this phenomenon:

  1. Summer Blockbuster Season: May and June fall within the traditional summer movie season in numerous regions. Studios strategically unveil high-budget blockbuster films during this period, targeting a broad audience. The warmer weather and school vacations further boost movie attendance.

  2. Strategic Release Patterns: The film industry acknowledges this pattern, leading to a clustering effect. Recognizing the advantageous months, more popular and anticipated films tend to be strategically released during May and June. This intentional scheduling capitalizes on the observed heightened audience engagement during these months.

  3. Genre Preferences: Certain movie genres, such as action, adventure, and fantasy, are often associated with May and June releases as seen in the graph. These genres tend to draw larger audiences and generate higher revenue, contributing to the observed pattern. (Median Revenue used to account for outliers)

    term

    df

    sumsq

    meansq

    statistic

    p.value

    Month

    11

    8.114459e+18

    7.376781e+17

    10.76122

    0

    Residuals

    1848

    1.266797e+20

    6.854964e+16

    NA

    NA

Another notable observation is the seasonality exhibited in the average revenue over a year. The seasonal strength, quantified by a value of 0.5817723, signifies a substantial recurring pattern within our data set.

This strong seasonality implies that there are recurring trends or patterns in revenue that manifest on an annual basis. It suggests that certain times of the year consistently contribute to increased or decreased revenue. Understanding and leveraging this seasonality can be pivotal for strategic decision-making in the realm of film releases.

In practical terms, this finding prompts a closer examination of the temporal distribution of revenue throughout the year. A more detailed exploration of which months or seasons contribute significantly to high or low average revenues can unveil insights that may guide release strategies, marketing efforts, or resource allocation.

trend_strength seasonal_strength_year seasonal_peak_year seasonal_trough_year spikiness linearity curvature stl_e_acf1 stl_e_acf10
0.5208086 0.5817723 5 8 8.034234e+27 1142997602 -136943061 -0.0645241 0.0550351

Film Characteristics and Insights

Undoubtedly, a film’s budget stands as the biggest influencer of its success, controlling most aspects of production. A substantial financial backing allows for elevated production values, sophisticated marketing strategies, and the recruitment of established talent, all crucial elements that contribute to a film’s overall quality and marketability. This dynamic relationship is graphically portrayed by the slope of the regression line, emphasizing the large influence of budget and the multifaceted components shaping a film’s trajectory.

Over the course of cinematic history, there has been a gradual and discernible increase in the average run time of films. Films also tend to do better the longer they are. However, the difference isn’t as drastic between short and medium length films. Films were split into the following categories:

  • Short: Less than 90 Minutes
  • Medium: Less than 120 Minutes
  • Long: 120+ Minutes

The rating of a film is important as it dictates the specific demographics to which the film is likely to appeal. While the rating, in general, might not wield a substantial influence on the overall revenue of a film, a notable exception is observed with the “R” rating. Films carrying an “R” rating exhibit a significant decrease in average revenue, aligning with the overarching understanding that “R” rated movies cater to a comparatively smaller demographic, thus potentially limiting their market reach. This distinctive trend highlights the impact of content restrictions on audience accessibility and the subsequent financial performance of a film.

The influence of a film distributor on its earning potential is a pivotal factor in the cinematic landscape. One standout performer in this realm is Walt Disney Studios Motion Pictures, demonstrating a consistent track record of increasing the revenue of the films distributed over time. This commendable trend not only positions Disney as a powerhouse in film distribution but also emphasizes the strategic vision and market awareness that the studio brings to the table.

Walt Disney Studios Motion Pictures has distinguished itself by not only delivering successful individual film releases but also by fostering a cumulative improvement in revenue trends across its portfolio. This sustained success suggests a combination of effective marketing strategies, adept distribution planning, and a keen understanding of audience preferences. The studio’s ability to not only maintain but enhance its films’ revenue trajectories points to a dynamic and forward-thinking approach in navigating the ever-evolving landscape of the film industry.

The genre of a film is a crucial aspect that defines its style, tone, and overall artistic expression. It serves as a blueprint, giving audiences a general idea of what to expect and helping filmmakers convey their vision effectively. The genre serves as a crucial component in the marketing and promotion of a film. It helps studios target specific demographics and tailor promotional campaigns to reach the intended audience.

As illustrated in the chart, Sci-Fi and Adventure emerge as the most lucrative genres within the film industry. This can be primarily attributed to the presence of many blockbuster titles within these specific genres. Despite the presence of outliers, which could represent exceptional cases or singular phenomena, the overarching trend reflected in the chart suggests a consistent and widespread favoritism towards Sci-Fi and Adventure genres. This pattern implies that audiences are consistently drawn to these genres, reinforcing their status as the forefront contributors to the film industry’s financial success.

Top 10 Genres by Revenue
Genre Avg Revenue
Sci-Fi 390186395
Adventure 376108863
Musical 336913540
Fantasy 329471252
Animation 327635090
Action 318624319
Family 312315794
Comedy 230154236
Thriller 222654160
Mystery 217913536

cast <- df %>% 
  dplyr::select(Title, Worldwide, Cast) %>% 
  tidyr::separate_rows(Cast, sep = ", ") %>% 
  dplyr::group_by(Cast) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::mutate(Movies_Acted = case_when(
    Count >= 5 & Count <= 10 ~ '5-10',
    Count > 10 & Count <= 15 ~ '10-15',
    Count > 15 & Count <= 20 ~ '15-20',
    Count > 20 ~ '20+',
    TRUE ~ 'Less than 5'
  )) %>% 
  dplyr::group_by(Movies_Acted) %>% 
  dplyr::summarise(AvgRev = mean(AvgRev),
                   Count = n()) %>% 
  dplyr::arrange(desc(AvgRev))
cast
## # A tibble: 5 × 3
##   Movies_Acted     AvgRev Count
##   <chr>             <dbl> <int>
## 1 10-15        340892604.    52
## 2 20+          325170485.     9
## 3 5-10         297994310.   291
## 4 15-20        291142359.    31
## 5 Less than 5  200239184.  2763
# Changing preference for newer faces or different types of story telling. Still like regulars 
range_order <- c("Less than 5", "5-10", "10-15", "15-20", "20+")
cast$Movies_Acted <- factor(cast$Movies_Acted, levels = range_order)
ggplot(cast, aes(x = Movies_Acted, y = AvgRev, fill = factor(Count))) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_viridis_d() +
  labs(title = "Average Revenue by Number of Movies Acted",
       x = "Movies_Acted",
       y = "Average Revenue",
       fill = "Count") +
  theme_minimal()

star <- df %>% 
  dplyr::select(Worldwide, Star) %>% 
  dplyr::group_by(Star) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count >= 5) %>% 
  dplyr::arrange(desc(AvgRev))
star
## # A tibble: 93 × 3
##    Star                   AvgRev Count
##    <chr>                   <dbl> <int>
##  1 Robert Downey Jr. 1065872463.    11
##  2 Chris Pratt        942798923.     9
##  3 Tom Holland        894681963.     5
##  4 Daniel Radcliffe   873331103.     9
##  5 Elijah Wood        700348692.     5
##  6 Daniel Craig       584913297.     8
##  7 Jing Wu            579671044.     6
##  8 Tobey Maguire      548438356.     5
##  9 Kristen Stewart    542133565.     7
## 10 Chris Hemsworth    521315834      6
## # ℹ 83 more rows
star_plot <- star %>%  
  ggplot(aes(x = Count, y = AvgRev, size = Count, color = Count,
             text = paste("Star:", Star, "<br>Number of Movies:", Count, "<br>Average Revenue:", scales::dollar(AvgRev)))) +
  geom_point() +
  labs(title = "Movie Stars and Avg Revenue",
       x = "Number of Movies",
       y = "Average Revenue",
       size = "Number of Movies")
plotly::ggplotly(star_plot, tooltip = "text")
#RUN REGRESSION ANALYSIS

writer <- df %>% 
  dplyr::select(Title, Worldwide, Writer) %>% 
  stats::na.omit() %>% 
  dplyr::mutate(Writer_Count = str_count(Writer, ",") + 1) %>%
  dplyr::mutate(Grouped_Writer_Count = ifelse(Writer_Count >= 10, 10, Writer_Count)) %>%
  dplyr::group_by(Grouped_Writer_Count) %>% 
  dplyr::summarise(AvgRevenue = mean(Worldwide),
                   Count = n())

writer
## # A tibble: 10 × 3
##    Grouped_Writer_Count AvgRevenue Count
##                   <dbl>      <dbl> <int>
##  1                    1 192412071.   316
##  2                    2 227805328.   434
##  3                    3 210036030.   352
##  4                    4 282824821.   255
##  5                    5 301414854.   184
##  6                    6 307590921.   118
##  7                    7 369320972.    71
##  8                    8 390427176.    37
##  9                    9 363805457     39
## 10                   10 477376779.    52
# FIND AND GRAPH THE TOP DIRECTORS
# DIRECTOR GENRE GRAPH

director <- df %>% 
  dplyr::select(Title, Worldwide, Director) %>% 
  dplyr::mutate(Director_Count = str_count(Director, ",") + 1) %>% 
  dplyr::group_by(Director_Count) %>% 
  dplyr::summarise(AvgRevenue = median(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count > 2)

director
## # A tibble: 4 × 3
##   Director_Count AvgRevenue Count
##            <dbl>      <dbl> <int>
## 1              1  162091208  1641
## 2              2  180513586   187
## 3              3  256786742    25
## 4              4  191439347     6

Predictive Analysis

Building upon the insights gained from our prior information and analysis, we will employ diverse workflow models for predictive analytics. The objective is to ascertain the global box office revenue projections for upcoming, yet-to-be-released movies.

Regression Tree Analysis

Here, our aim is to train a model for utilizing a regression tree to forecast the global revenue of an upcoming release. In our preceding analysis, we established some influencing factors which we will incorporate to help the model perform.

In the subsequent analysis, we made a deliberate effort to incorporate these noteworthy elements, encompassing budget, distributor, release month, MPAA rating, runtime in minutes, the primary genre, and the count of genres.

Before examining the tree, let’s delve into how the model assigns importance to predictor variables. Unsurprisingly, budget emerges as the top indicator of movie revenue, succeeded by release month, distributor, and other variables. This analysis underscores the notion that the count of genres does indeed influence release revenue.

Below is the visualization of our decision tree and model performance metrics after running our test set.

Model Performance Metrics
.metric .estimator .estimate
rmse standard 2.103495e+08
rsq standard 2.969080e-01
mae standard 1.309467e+08

Error Analysis

The model’s performance metrics suggest suboptimal accuracy in predicting test set revenue. The mean absolute error (MAE) averages around 132 million, signifying notable deviations from actual values. Additionally, the R-squared value indicates that our factors explain only about 20% of the actual values. To explore model limitations, we’ll focus on outliers, identifying areas where the model struggles for potential enhancements. The plot below depicts the connection between residuals and actual values, highlighting instances of significant prediction deviations.

The chart suggests outliers in predictions, particularly in high-revenue areas.

Now, let’s analyze residuals, focusing on budget—the most influential factor. We’ll set “close” and “bad” thresholds for estimates, emphasizing inclusivity. Using budget, a key predictor, a boxplot highlights where our estimates succeed (budget ~ $50 million) and where they struggle (budget > $150 million), possibly due to the complexity of higher-budget films with additional influencing factors.

## # A tibble: 10 × 4
##     Worldwide   predicted   residuals MovieName                     
##         <dbl>       <dbl>       <dbl> <chr>                         
##  1 1346913161  179967611. 1166945550. Black Panther                 
##  2 1242805359  179967611. 1062837748. Incredibles 2                 
##  3 1450026933  448968713. 1001058220. Frozen II                     
##  4 1078751311  179967611.  898783700. Joker                         
##  5 1028570889  179967611.  848603278. Finding Dory                  
##  6 1023784195  179967611.  843816584. Zootopia                      
##  7  970766005  179967611.  790798394. Despicable Me 2               
##  8  950550490  218163671.  732386819. Oppenheimer                   
##  9 1153296293 1824677508. -671381215. Captain America: Civil War    
## 10 1308467944  674302171.  634165773. Jurassic World: Fallen Kingdom
## # A tibble: 10 × 4
##    Worldwide  predicted residuals MovieName                                
##        <dbl>      <dbl>     <dbl> <chr>                                    
##  1 179769457 179967611.  -198154. The Post                                 
##  2 180563636 179967611.   596025. Hacksaw Ridge                            
##  3 179246868 179967611.  -720743. Allegiant                                
##  4 180899045 179967611.   931434. Kill Bill: Vol. 1                        
##  5 181216833 179967611.  1249222. Scooby-Doo 2: Monsters Unleashed         
##  6 177856751 179967611. -2110860. Baywatch                                 
##  7 333535934 337193601. -3657667. Fantastic Four                           
##  8 176104344 179967611. -3863267. Dr. Dolittle 2                           
##  9 175835580 179967611. -4132031. The Monkey King: Havoc in Heaven's Palace
## 10 175492224 179967611. -4475387. Suzume
## # A tibble: 373 × 3
##    Worldwide  predicted   residuals
##        <dbl>      <dbl>       <dbl>
##  1 950550490 218163671.  732386819.
##  2 569626289 179967611.  389658678.
##  3 283246920 179967611.  103279309.
##  4 208177026 179967611.   28209415.
##  5 194131360 179967611.   14163749.
##  6 122031954 179967611.  -57935657.
##  7  90060106 179967611.  -89907505.
##  8  83038729 179967611.  -96928882.
##  9  74191973 179967611. -105775638.
## 10  60730568 179967611. -119237043.
## # ℹ 363 more rows

library(dplyr)
library(parsnip)
library(ggplot2)
# Split data

df_test_new <- df_test %>%
  dplyr::filter(Budget < 125000000)

test_baked <- recipes::bake(recipe_pipeline, df_test_new)  # Corrected to use test_baked

# Modeling
model <- parsnip::decision_tree(mode = "regression") %>%
  parsnip::set_engine("rpart") %>%
  parsnip::fit(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre` + `count_genres`, data = train_baked)

# Predictions on the test set
test_predictions <- predict(model, new_data = test_baked)

# Obtain and showcase the model's predictions and actual values
res <- model %>% predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% dplyr::select(Worldwide))


# Highlight key metrics for a comprehensive view
metrics_table_2 <- res %>% 
  yardstick::metrics(truth = Worldwide, estimate = .pred) %>%
  knitr::kable(align = "c", caption = "Model Performance Metrics")

metrics_table
Model Performance Metrics
.metric .estimator .estimate
rmse standard 2.103495e+08
rsq standard 2.969080e-01
mae standard 1.309467e+08
##      Worldwide  predicted  residuals      Budget Run Time (Mins) count_genres
## 1 0.0001114872 0.07173429 0.06153934 0.003145801       0.5069284    0.1898288